Group (i) | πθ_old(oi|q) | πθ(oi|q) | Reward (ri) |
---|
Group (i) | Advantage (Ai) |
---|
Group (i) | Ratio | Clipped Ratio | Unclipped Term | Clipped Term | Min(Terms) |
---|
How much the current policy differs from the reference
β × KL Divergence
Average of clipped terms
Surrogate - KL Penalty
The policy ratio πθ(oi|q) / πθ_old(oi|q) measures how much more (or less) likely the new policy is to select output oi compared to the old policy.
Clipping this ratio to the range [1-ε, 1+ε] prevents large policy updates, ensuring stability. This is a hallmark feature of Proximal Policy Optimization (PPO).
When the advantage Ai is positive, clipping limits how much the policy can improve for that action. When Ai is negative, clipping limits how much the policy can decrease the probability of that action.
The advantage function Ai quantifies how good or bad selecting output oi is compared to the average.
It's calculated by normalizing the rewards to zero mean and unit variance: Ai = (ri - mean(rewards)) / std(rewards)
This normalization stabilizes training across varying reward scales:
The KL Divergence penalty encourages the policy πθ to stay close to a reference policy πref.
The formula is: DKL(πθ || πref) = Σ πθ(oi|q) log(πθ(oi|q) / πref(oi|q))
In this dashboard, we're using the old policy as the reference policy for simplicity.
The hyperparameter β controls the strength of this penalty. Higher values of β discourage large policy changes.
The final GRPO objective combines the clipped surrogate objective with the KL divergence penalty:
JGRPO(θ) = EQ[1/G Σi min(ratio × Ai, clipped_ratio × Ai) - β × DKL]
The algorithm aims to maximize this objective function, which means:
This balance leads to stable and efficient policy improvement.
This interactive explorer was created with Claude Sonnet 3.7. Here is the prompt with the DeepSeek R1 paper (see below) attached: make an interactive dashboard in html/js/bootstrap to understand the GRPO with connected widgets in a dynamic way.
By Nicolas Martin - Fractal-Apps - 4/2025
Large Language Models (LLMs) are rapidly advancing, bringing us closer to the vision of Artificial General Intelligence (AGI). A critical aspect of this evolution is enhancing their reasoning capabilities. While approaches like increasing Chain-of-Thought (CoT) length during inference have shown promise, the challenge of effective test-time scaling remains. DeepSeek-AI introduces DeepSeek-R1 and DeepSeek-R1-Zero, their first-generation reasoning models, which explore the power of reinforcement learning (RL) to push the boundaries of LLM reasoning.
The journey began with DeepSeek-R1-Zero, a model trained through large-scale reinforcement learning without the initial step of supervised fine-tuning (SFT). This was a significant undertaking, aiming to investigate the potential for LLMs to develop reasoning abilities purely through self-evolution via RL. Using DeepSeek-V3-Base as the foundation and employing Group Relative Policy Optimization (GRPO) as the RL framework, DeepSeek-R1-Zero remarkably demonstrated the emergence of powerful and intriguing reasoning behaviors. This included capabilities like self-verification, reflection, and generating lengthy Chains of Thought, validating that RL alone can incentivize reasoning.
At the heart of DeepSeek-R1's training lies Group Relative Policy Optimization (GRPO). This RL algorithm is designed to be more computationally efficient by foregoing a critic model, typically the same size as the policy model. Instead, GRPO estimates a baseline from group scores.
Imagine you're trying to teach a model to solve a complex problem through trial and error. In traditional RL, a critic would evaluate each attempt individually. GRPO, however, takes a different approach. For a given problem, it samples a group of potential solutions generated by the model's current strategy. It then evaluates the quality of each solution within that group, and importantly, calculates an "advantage" for each solution based on how much better or worse it is compared to the average performance of the group. This group-relative comparison helps the model understand which variations in its approach led to better outcomes, allowing it to refine its strategy more efficiently.
To truly grasp GRPO, let's conceptualize an interactive explorer. Think of a dashboard where you could adjust key variables and observe their impact on the learning process:
This interactive exploration highlights the nuanced interplay of factors within GRPO that contribute to the model's learning and the emergence of complex reasoning abilities.
While DeepSeek-R1-Zero showcased the power of pure RL, it faced challenges, notably poor readability and language mixing in its outputs. To address these and further boost performance, DeepSeek-R1 was introduced, incorporating a multi-stage training pipeline with a "cold start".
The cold start involved fine-tuning the DeepSeek-V3-Base model with a small dataset of high-quality, human-friendly Chain-of-Thought examples. This provided an initial boost in readability and guided the model towards a more desirable output format. The training then proceeded with reasoning-oriented RL, similar to DeepSeek-R1-Zero, but with the addition of a language consistency reward to mitigate mixing issues.
The pipeline didn't stop there. After the reasoning-oriented RL converged, a stage of rejection sampling and supervised fine-tuning was introduced. This involved generating new SFT data by sampling from the RL checkpoint and combining it with supervised data from other domains like writing and factual QA. This aimed to enhance the model's general capabilities alongside its reasoning prowess. Finally, a second RL stage was implemented, considering prompts from all scenarios to further align the model with human preferences for helpfulness and harmlessness.
This multi-stage approach of DeepSeek-R1, starting with a cold start and iteratively refining through RL and SFT, demonstrates an ingenious strategy to combine the exploration power of RL with the guidance of supervised data to achieve both high performance and user-friendly outputs.
DeepSeek-AI didn't just focus on training large, powerful models; they also explored how to make these advancements more accessible. A key innovation is the distillation of DeepSeek-R1's reasoning capabilities into smaller dense models. By fine-tuning models like Qwen and Llama using data curated with DeepSeek-R1, they demonstrated that the reasoning patterns learned by larger models can be effectively transferred to smaller ones.
This distillation process is significant because it allows smaller, more efficient models to achieve reasoning performance comparable to larger models. The open-sourcing of these distilled models contributes valuable resources to the research community, enabling broader access to advanced reasoning capabilities.
DeepSeek-R1 has demonstrated impressive performance across a range of benchmarks. It achieved results comparable to OpenAI-01-1217 on reasoning tasks like AIME 2024 and MATH-500. On coding tasks, it reached an expert level on Codeforces. Beyond reasoning, DeepSeek-R1 also excels in knowledge-based tasks and general-purpose applications like creative writing and summarization. The distilled smaller models also set new records on reasoning benchmarks among dense models.
The work on DeepSeek-R1 and DeepSeek-R1-Zero highlights the immense potential of reinforcement learning in developing advanced reasoning abilities in LLMs. By open-sourcing their models and detailing their training methodologies, DeepSeek-AI provides valuable insights and resources for the research community to continue exploring the frontiers of AI reasoning. The ingenuity of their multi-stage training pipeline and the effectiveness of their distillation process pave the way for more capable and accessible language models in the future.